CJKV GOD - DoYe's Chaos

Recently, An article was submitted to introduce Smaji CJKV. Several review comments were received suggesting appending some citations and prerequisite information to the article. These comments make sense. After all, the development of most disciplines and engineering is a continuous progression. Mostly, new development is built on the foundation of their predecessors.

Although more information was in demand, initially, I found it a bit difficult to append information. Because in the past two decades, Recording, encoding, and font designing of variant or rarely-used characters, all these techniques had been researched and developed but haven’t created much influence. Most of them are independent private systems that cannot be integrated into general systems. Some of them are relatively open but only open in user interface level, while others, relatively open and standard, are incompetent in infrastructure and inadequate in serving as the basis for subsequent development. These systems are not worth mentioning in references or prerequisite knowledge.

In 1999, Unicode’s own Ideographic Description Characters were introduced in Unicode 3.0. The sequence of that characters is called "Ideographic Description Sequence", i.e. IDS. It is naturally integrated into the daily-used general systems based on Unicode, has huge user base and is easy to use. For example, the word "时间" can be expressed as "⿰日寸" and "⿵门日" respectively. Even a character as complex as "𰻝" as seen in word "𰻝𰻝面", can also be expressed as "⿺辶⿳穴⿲月⿱⿲幺言幺⿲长马长刂心". At first glance, the functionalities are complete.

Click here to view 𰻝

But the problem is that when it comes to "丝", IDS cannot decompose it. Because Unicode does not include the character which looks like "幺" minus the last dot. Another example is "乔", with "夭" above and "?" below, which is also an uncollected character. Another example is the decomposition of the following characters: "与","乌","亇","争","亥","以"…

Because too many "components" or "roots" actually do not have characters corresponded with, and this system requires that their definition domains and value domains are all Unicode collected characters. Therefore, this system design was incomplete from the beginning: common or even daily-used characters may be out of the scope of describable.

Other private systems, which are aware of this problem, relaxed the restrictions on the definition domain and introduced private components. However, the composition of Chinese characters or components is diverse, and IDS and similar systems can only describe some ideal composition. A slightly less ideal one, such as "⿻", which means that two components overlap, is ambiguous. How exactly do they overlap, what is the direction of overlap, and what is the degree of overlap? No description at all. Thus, the glyph cannot be restored from the IDS. The result is yet another set of broken and incomplete systems.

However, the review comments also prompted me to think again, whether the efforts and legacies of the past are still valuable, or can they still be useful after being transformed and refined?

A general summary of the flaw of past explorations are listed as follows:

The domain of composite component is limited
IDS lacks accuracy
Being not universal or narrow in application scenarios

The solution is designed accordingly:

The domain of composite component is limited
The first step is to lift this restriction, and in a way that does not create new problems. Therefore, the following conditions must be met:
1. The domain is not only limited to Unicode included characters. Because of its incompleteness.
2. The defined base components must be able to composite any characters. Otherwise, it becomes another incomplete system.
3. Basic base components may not be added, deleted, or modified arbitrarily. To avoid causing failure and instability of the composition method.
Given these three requirements, it is expected that basic strokes are the ideal choice that meets all the above requirements. But what we need are not roughly the so-called basic five types of strokes, we need to enumerate at least 63 basic strokes, as well as mirror(left-right, up-down) and rotation operators. Because there are mirror characters and inverted characters in Chinese characters.
IDS lacks accuracy

The structure described by IDS conform to some patterns, that is, the components described are vertically centered (⿱, ⿳) or horizontally centered (⿰, ⿲) or fully wrapped (⿴) or three-sides-surrounded (⿵,⿶,⿷,⿼) or two-adjacent-sides-surrounded (⿸,⿹,⿺,⿽). The described components operated by these descriptors all form new shapes. For center-separated components, we only need to calculate the length or width and take the average, and each component can adjust the aspect ratio based on the average to obtain a new shape. If the structure is surrounded, the inter components are best-fitted to and scaled down a bit according to the outer component.

The descriptor ⿻ represents the description that two operands overlap with each other, which breaks the frame. Therefore, the shape of the components cannot be used as the basis for calculation in component arrangement. Besides, the descriptors(IDS) and operating components(strokes or roots) does not have any other intrinsical calculation basis, which leads to the inability of this description system.

Therefore, we have to introduce additional information to fill in the gaps. The shape of the components described by the separation or enclosing descriptor are preserved, so are the combination of the described components, and the outer frame box of the combined components is their outer frame. There are several kinds of data: the size and position of the outer frame, and the size and position of the components after being embedded in the outer frame. So finally we can get the position and size information of the components with the outer frame as the origin of the coordinate system.

After the descriptor ⿻ disables the component shape, the corresponding outer frame calculation cannot be performed, nor can the position and size information of the components. So, what we need to supplement is these two kinds of information with which the outer frame information can also be derived from the best-fit frame box.

To describe plane position and size information, we need to introduce a plane coordinate system.

The description of plane coordinates is a topic worth expanding on, and we will discuss it later. Now, let’s take a look at defect 3.
Being not universal or narrow in application scenarios

Unicode Character Set is required to be a standard information interchange set, so character components or roots must be selected from its own dataset. The basic components included in its own dataset has not covered the necessary essential components. Besides, the description capability of Unicode’s own Ideographic Description Characters (IDC) is incomplete. This resulted in defects 1, 2.

However, universities, technical groups, and commercial organizations other than Unicode Consortium had also tried to design or implement systems that are both Unicode compatible and of description capability complete. Most of them are close to be complete, and some are Unicode incompatible, few are perfect, thus limiting their application scenarios.

Another important reason is that the requirements for flexibility and real-time are difficult to reach. For example, a scholar once needs to quote excerpts from an ancient book, but in which several of the texts have multiple variations and are not included in the standard. Or an ancient book has been newly unearthed, and some characters that have not been seen before appear. It needs to be introduced into the standard and our computer system must be updated so that the characters can be encoded and displayed properly.

The above requirements require a long and possibly failing Unicode routine, which definitely will affect the progress of article writing.

The solution to this defect is given in Smaji CJKV, so I won’t go into details.

In fact, Smaji CJKV did not have a plan to design a glyph description system at the beginning. Only bitmap or vector images are allowed to be submitted. It became possible to design the describe system when the core system was set up and keep compatible with the Unicode system. The reviewers' suggestion for supplementary information mentioned earlier made me rethink my past experience, and then the design the glyph description language was started.

Well, let’s solve the problems skipped before:

IDS lacks accuracy

The idea and method to solve this problem require more space to describe, so the following subsection is added.

Glyph Outline Description Language

Because the standard form of this language is xml document, an XML Schema Definition is most suitable to describe it. The following is the very syntax description document god.xsd of this language.

Create XML document

An XML document consists of an optional XML declaration, an optional document type declaration, and a document (root) element.

The version declaration of an xml ensures that future XML changes will not affect the syntax and semantics of this document. The encoding declaration tells the XML processor the encoding used by this document. The XML version used by the GOD 1.0 document is 1.0, and the encoding is UTF-8. So its XML The encoding header is certain:

<?xml version="1.0" encoding="UTF-8"?>

Because the xml version defaults to 1.0, and the default available encoding can be UTF-8 or UTF-16, the declaration header above is not necessary.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<?xml version="1.0"?>
<god version="1.0"
  xmlns="http://cjkv.smaji.org/ns/god"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://cjkv.smaji.org/ns/god http://cjkv.smaji.org/xml/1.0/xsd/god.xsd">
  <glyph unicode="516b,0">
    <stroke type="t" x="0" y="0" width="56" height="112"/>
    <stroke type="p" x="76" y="0" width="56" height="112"/>
  </glyph>
</god>

The first line is an optional XML declaration.
Lines 2 and 10 start and end a god root element. The root element is mainly used to indicate the version of this god document. The version attribute in the second line indicates that this god document adopts the syntax and semantics of version 1.0.
The fourth and fifth lines are optional and are used to introduce the XSD description of this god document so that capable text editors can use it to verify the correctness of the god document being edited and provide suggestions such as auto-completion.

The next child element is glyph. It contains a required attribute unicode, used to indicate the unicode scalar of the glyph described in this god document. Its value is a hexadecimal number representing a unicode scalar, and after the number, a value called variation selector can be appended separating by a comma. In the example, the value of the unicode property is 516b, which is the unicode scalar of the Chinese character 「八」.

「八」 consists of two strokes, the first stroke is a throw (撇), and the second stroke is a press (捺), so in the glyph element, we add two sub-elements, namely stroke t (撇) and stroke p (捺). And in the coordinate system, the position, width, and length information of each stroke is given. For more information of the stroke type in god. Please consult the god.xsd file.

The following table is an excerpt from god.xsd for reference.

Click here to view an excerpt from the god.xsd

h     | Horizontal
sh    | Slanted Horizontal
u     | Upward horizontal
du    | Dot – Upward horizontal
v     | Vertical
sv    | Slanted Vertical
rsv   | Right Slanted Vertical
t     | Throw
ft    | Flat Throw
wt    | Wilted Throw
d     | Dot
ed    | Extended Dot
ld    | Left Dot
wd    | Wilted Dot
p     | Press
up    | Upward horizontal – Press
hp    | Horizontal – Press
fp    | Flat Press
ufp   | Upward horizontal – Flat Press
c     | Clockwise curve
a     | Anticlockwise curve
o     | Oval
hj    | Horizontal – J hook
uj    | Upward horizontal – J hook
ht    | Horizontal – Throw
hsv   | Horizontal – Slanted Vertical
hv    | Horizontal – Vertical
hvj   | Horizontal – Vertical – J hook
htj   | Horizontal – Throw – J hook
utj   | Upward horizontal – Throw – J hook
hvh   | Horizontal – Vertical – Horizontal
hvu   | Horizontal – Vertical – Upward horizontal
ha    | Horizontal – Anticlockwise curve
haj   | Horizontal – Anticlockwise curve – J hook
hpj   | Horizontal – Press – J hook
htaj  | Horizontal – Throw – Anticlockwise curve – J hook
htc   | Horizontal – Throw – Clockwise curve
htht  | Horizontal – Throw – Horizontal – Throw
htcj  | Horizontal – Throw – Clockwise curve – J hook
hvhv  | Horizontal – Vertical – Horizontal – Vertical
hthtj | Horizontal – Throw – Horizontal – Throw – J hook
vu    | Vertical – Upward horizontal
vh    | Vertical – Horizontal
va    | Vertical – Anticlockwise curve
vaj   | Vertical – Anticlockwise curve – J hook
vhv   | Vertical – Horizontal – Vertical
vht   | Vertical – Horizontal – Throw
vhtj  | Vertical – Horizontal – Throw – J hook
vj    | Vertical – J hook
vc    | Vertical – Clockwise curve
vcj   | Vertical – Clockwise curve – J hook
tu    | Throw – Upward horizontal
th    | Throw – Horizontal
td    | Throw – Dot
wtd   | Wilted Throw – Dot
tht   | Throw – Horizontal – Throw
thtj  | Throw – Horizontal – Throw – J hook
tj    | Throw – J hook
cj    | Clockwise curve – J hook
fpj   | Flat Press – J hook
pj    | Press – J hook
thtaj | Throw – Horizontal – Throw – Anticlockwise curve – J hook
tod   | Throw – Oval – Dot

Click here to view the corresponding graphics

Table 1. Inherited names of CJK basic and compound strokes (63 items)
Chinese name	Abbr form	Full name	Name in Unicode	Example
橫	H	Horizontal	H	三言隹花
斜橫	SH	Slanted Horizontal	(H)	七弋宅戈
挑	U	Upward horizontal	T	刁求虫地
點挑	DU	Dot – Upward horizontal	(T)	冰冷汗汁
豎	V	Vertical	S	十圭川仆
斜豎	SV	Slanted Vertical	(S)	丑五亙貫
右斜豎	RSV	Right Slanted Vertical	(S)	𠙴
撇	T	Throw	P	竹大乂勿
扁撇	FT	Flat Throw	(P)	千乏禾斤
直撇	WT	Wilted Throw	SP	九厄月几
點	D	Dot	D	主卜夕凡
長點	ED	Extended Dot	(D)	囪囟这凶
左點	LD	Left Dot	(D)	心忙恭烹
直點	WD	Wilted Dot	(D)	六文宇空
捺	P	Press	N	人木尺冬
挑捺	UP	Upward horizontal – Press	TN	文廴父爻
橫捺	HP	Horizontal – Press	(TN)	入八內全
扁捺	FP	Flat Press	(N)	走足廴麵
挑扁捺	UFP	Upward horizontal – Flat Press	(TN)	之乏巡迴
彎	C	Clockwise curve	W
曲	A	Anticlockwise curve	X
圈	O	Oval	Q	〇㔔㪳㫈
橫鈎	HJ	Horizontal – J hook	HG	冧欠冝蛋
挑鈎	UJ	Upward horizontal – J hook	(HG)	也乜池馳
橫撇	HT	Horizontal – Throw	HP	夕水登令
橫斜	HSV	Horizontal – Slanted Vertical	(HP)	今彔互恆
橫豎	HV	Horizontal – Vertical	HZ	口己臼典
橫豎鈎	HVJ	Horizontal – Vertical – J hook	HZG	而永印令
橫撇鈎	HTJ	Horizontal – Throw – J hook	(HZG)	勺方力母
挑撇鈎	UTJ	Upward horizontal – Throw – J hook	(HZG)	也乜池馳
橫豎橫	HVH	Horizontal – Vertical – Horizontal	HZZ	凹兕卍雋
橫豎挑	HVU	Horizontal – Vertical – Upward horizontal	HZT	殼鸠说计
橫曲	HA	Horizontal – Anticlockwise curve	HZW	朵沿殳没
橫曲鈎	HAJ	Horizontal – Anticlockwise curve – J hook	HZWG	九几凡亢
橫捺鈎	HPJ	Horizontal – Press – J hook	(HZWG)	風迅飛凰
橫撇曲鈎	HTAJ	Horizontal – Throw – Anticlockwise curve – J hook	HXWG	乙氹乞乭
橫撇彎	HTC	Horizontal – Throw – Clockwise curve	---	過过這这
橫撇橫撇	HTHT	Horizontal – Throw – Horizontal – Throw	HZZP	延建巡及
橫撇彎鈎	HTCJ	Horizontal – Throw – Clockwise curve – J hook	HPWG	陳陌那耶
橫豎橫豎	HVHV	Horizontal – Vertical – Horizontal – Vertical	HZZZ	凸𡸭𠱂𢫋
橫撇橫撇鈎	HTHTJ	Horizontal – Throw – Horizontal – Throw – J hook	HZZZG	乃孕仍盈
豎挑	VU	Vertical – Upward horizontal	ST	卬氏衣比
豎橫	VH	Vertical – Horizontal	SZ	山世匡直
豎曲	VA	Vertical – Anticlockwise curve	SW	區亡四匹
豎曲鈎	VAJ	Vertical – Anticlockwise curve – J hook	SWG	孔已亂也
豎橫豎	VHV	Vertical – Horizontal – Vertical	SZZ	鼎亞吳卐
豎橫撇	VHT	Vertical – Horizontal – Throw	(SZZ)	奊捑𠱐𧦮
豎橫撇鈎	VHTJ	Vertical – Horizontal – Throw – J hook	SZWG	弓弟丐弱
豎鈎	VJ	Vertical – J hook	SG	小水到寸
豎彎	VC	Vertical – Clockwise curve	SWZ	肅嘯蕭瀟
豎彎鈎	VCJ	Vertical – Clockwise curve – J hook	---	𨙨𨛜𨞠𨞰
撇挑	TU	Throw – Upward horizontal	PZ	去公玄鄉
撇橫	TH	Throw – Horizontal	(SZ)	互母牙车
撇點	TD	Throw – Dot	PD	巡兪巢粼
直撇點	WTD	Wilted Throw – Dot	(PD)	女如姦㜢
撇橫撇	THT	Throw – Horizontal – Throw	(SZZ)	夨𠨮专砖
撇橫撇鈎	THTJ	Throw – Horizontal – Throw – J hook	(SZWG)	巧亟污號
撇鈎	TJ	Throw – J hook	PG	乄
彎鈎	CJ	Clockwise curve – J hook	WG	狗豸豕象
扁捺鈎	FPJ	Flat Press – J hook	BXG	心必沁厯
捺鈎	PJ	Press – J hook	XG	弋戈我銭
撇橫撇曲鈎	THTAJ	Throw – Horizontal – Throw – Anticlockwise curve – J hook	---	𠃉𦲳𦴱鳦
撇圈點	TOD	Throw – Oval – Dot	---	𡧑𡆢

After being processed by the glyph outline generation program provided by Smaji CJKV, the following outline file is generated, which can be used in a font editor.

In god, strokes are used to form glyphs, so are the existing characters. For example, the character "丕" can be composed of the character "不" plus "一".

1
2
3
4
5
6
7
<?xml version="1.0"?>
<god version="1.0" xmlns="http://cjkv.smaji.org/ns/god">
  <glyph unicode="4e15,0">
    <ref unicode= "4e0d" x="0" y="0" width="128" height="120"/>
    <stroke type="h" x="0" y="114" width="128" height="14"/>
  </glyph>
</god>

Of course, although using unicode scalar directly is accurate, typing in a character instead is also a very good choice for commonly used and unambiguous characters. The god file above can also be rewritten into the following form. Change line 4 to

<character utf8= "不" x="0" y="0" width="128" height="120"/>

Get the following god file

1
2
3
4
5
6
7
<?xml version="1.0"?>
<god version="1.0" xmlns="http://cjkv.smaji.org/ns/god">
  <glyph unicode="4e15,0">
    <character utf8= "不" x="0" y="0" width="128" height="120"/>
    <stroke type="h" x="0" y="114" width="128" height="14"/>
  </glyph>
</god>

The following glyph outlines can be produced:

Let’s take a look at another glyph outline:

Doesn’t it look like "了" turned upside down? Indeed, in Chinese characters, there are left-right mirror characters, up-down mirror characters, and rotated characters. The character illustrated is a rotating one. So how does it described in god?

1
2
3
4
5
6
<?xml version="1.0"?>
<god version="1.0" xmlns="http://cjkv.smaji.org/ns/god">
  <glyph unicode="2010f,0" transform="rotate180">
    <character utf8="了" x="0" y="0" width="88" height="128" />
  </glyph>
</god>

One of the design concepts in god is that for Chinese characters after Liding(隶定) and Libian(隶变), their composition is a combination of basic components and strokes, rather than the manipulation of basic components and strokes. Therefore, mirroring or rotating operations only work on the characters as a whole.

Therefore, we can add transform attribute to the glyph element and

mirror_horizontal
mirror_vertical
rotate180

are given to choose from as the attribute’s value to indicate the transition.

Because the glyph of unicode 2010f is exactly the rotation of the character "了". So in this god file, the 6th line indicates that the transform attribute is rotate180, and the 7th line directly introduces the glyph of the character "了" as the basis. That is, the required glyph is obtained.

Smaji CJVK support for GOD

Smaji Glyph Outline

An OCaml library for reading, exporting, and converting glyph outline data and files.

Supported glyph outline formats are:

SVG, Scalable Vector Graphics. It is extremely widely used and supports an unusually rich range of vector graphics formats.
GLIF, Glyph Interchange Format. for Unified Font Object

CJKV GOD

2024-02-16 Fri cjkv cjkv / unicode / god

Glyph Outline Description Language

Create XML document

Smaji CJVK support for GOD

Smaji Glyph Outline

Smaji God

Smaji DynGlyph

Smaji DynGlyph Collection

Online God Editor

Comments